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Comparison of IRT Observed-Score and True-Score 'Equatings' 

* 

Abstract 

Two methods of 'equating 1 tests using item response theory are 
compared, one using true scores, the other using the estimated distri- 
bution of observed scores. On the data studied, they yield almost 
indistinguishable results. This is a reassuring result for users of 
IRT equating methods. 

\ 
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Comparison of IRT Observed-Score and True-Score 'Equatings'* 

Most IRT equating is currently attempted by the true-score equating 
procedure described in Lord (1980, Chapter 13). Lord also describes 
an 1AT observed-score procedure, which until now seems not to have been 

r 

further investigated, perhaps because it is more complicated and more 
expensive than the true-score procedure. The present article reports 
an empirical research study comparing the results of applying these two 
procedures to real test data. 

Sections 1 and 2 outline the true-score and the observed-score 
procedures, respectively. Section 3 discusses the theoretical 
advantages and disadvantages of each procedure. Section A describes 
the real test data used to provide a comparison of the two methods. 
Section 5 describes the procedures for estimating item and ability 
parameters. Section 6 reports and summarizes the empirical results. 

Item response theory models the probability of a correct response - 
by an examinee to a test item as a monotonically increasing function of 
ability. The model used here is Birnbaum's three-parameter logistic 
model given by the following formula: 



*This work was supported in part by contract N0001A-80-C-0A02 , 
project designation NR 150-A53 between the Office of Naval Research and 
Educational Testing Service. Reproduction in whole or in part in 
permitted for any purple of the United States Government. 
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P 1 (6 a ) - c ± + (1 - c 1 )/(l + exp(-1.7a 1 (6 a - b ± ))) (1) 
where P. (8 ) is the probability of examinee a getting item i 

1 3 

t 

correct 

b ± is the difficulty of item i ; 

is the discrimination index for item i ; 

is the lower asymptote for item i ; 

0 is the ability of examinee a ( -°° < 6 < « ). 
a J a 

P. (0 ) has a minimum of c. and a maximum of 1. This model assumes 
i a i 

that the test is unidimensional . 

1 . True-Score Equating 

Since the expected score of examinee a on item i is ^±^ a ^ » *-he 

examinee's expected number of right answers is 2 i P .j>( e a ) • In classical 

test theory, this expectation is called the (number-right) true 

score, E = 2 J P J (0 ) . For the moment, we do not deal with the scores 
* a i i a 

of particular examinees, so the subscript a will be dropped. Here 
the true score for test X containing n items is the mathematical 
variable 

(2) 



n 

€ = 2 p.(e) 

i=l 
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a monotonic increasing function of 6 . If test Y contains m items 
and measures the same ability 0 as test X , the true score on test 
Y is the mathematical- variable 



Z P (6) . (3) 
j-1 J 



The variables 5 , n , 8 are all measures of the same psycholog- 
ical trait, they differ only in the numerical scale on which the 
measurements are expressed. Thus true scores 5 - K Q and r\ " n Q 
corresponding to any given 6 = Q q represent identical levels of 
ability. Any examinee whose true score on test X is K Q must 
automatically have a true score on test Y of exactly H Q , provided 
the IRT model holds. The situation is the same as when we say that 32° 
Farenheit has the same meaning as 0° Celsius, except that these 
temperature scales have a linear relationship, whereas the true-score 
scales have a nonlinear relationship. Thus, K Q and n Q are equated 
true scores; this is true in a much stronger sense chan is usually 
implied by the term equated . 

In IRT true-score equating, fistimated item parameters are sub- 
stituted into (2) and (3) and a table of corresponding values of K 
and ri is calculated. This constitutes the true-score equating 
table. This table is then applied in practice as if the true scores 
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were observed number-right scores. Since observed scores have different 
properties than true scores, this last step has no cletir theoretical 
justification. It is done as a practical procedure, to be Justitied 
only by whatever usefulness and reasonableness can be empirically 
demonstrated for the results. 

2. IRT Observed-Score Equating 

If the assumptions of IRT hold (as is assumed throughout), the 
probability that an examinee of ability 0 will have a number- 
right score of x = 1 on a two-item test is P^ + Q ] V 2 , where 

P i S P j/ 0 ) and a 1 " P i * The P* obabilit y that tlli8 examinee's 
score is 0 or 1 is or respectively. These probabilities 

constitute the conditional frequency distribution f^Cxje) . 

If a third item is added to this test, the distribution of x 
is now 

£ 3 (x|8) = Q 3 £ 2 (x|8) + P 3 f 2 (x 1|8) ( x « 0,1,... ,3 ) . 

where f (x|8) = 0 if x < 0 or x > r . Using this recursive 
procedure, a computer can readily determine f^(x\d) , even for an n 
of several hundred. . ^ 

If the 0 of each examinee is known, the (marginal) distribution ; 
of x for a group of N examinees is 
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N 



1 e f n (x|ej . <*> 



a-1 

If an m -item test Y yields number-right score y and measures the 
same ability as test X , then the (marginal) distribution of y for 

a group of M examinees is 

A monotonia transformation of the y scores can now be found from (4) 
and (5) such that the distribution of the transformed y scores is the 
same as the distribution of the (untransf ormed) x scores, except for 
irregularities due to the fact that x and y can only assume integer 
values. This is done by finding, for each y score, the x that has 
the same percentile rank in (4) that y has in (5). The x so found 
is the desired transformed y score. 

If the examinees who took test Y have the name distribution of 
0 as the examinees who took test X , then the resulting transformation 
of y is an 'equipercentile equating' of the y scale to the x 
scale. Within groups similar to the group s used to derive _the 
t ransformation , it has the valuable property that if a cutting score 
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is chosen on the x scale and the same cutting score is used on the 
transformed y scale, the proportion of test X examinees selected 
will be the same as the proportion of test Y examinees selected. This 
property is essential if test X and test Y examinees are both to be 
treated equitably, so that an examinee cannot complain that he was 
injured by the choice of test administered. 

When the groups taking tests X and Y are known to have 
approximately the same distribution of 6 (for example, they are two 
random samples from the same population), there is no reason to use IRT. 
It is much simpler to do thk equipercentile equating using the actual 
sample distributions of x and y , instead of (4) and (5). The need 
for IRT arises when the ability distributions of the two groups may dif- 
fer. In this case, IRT may allow us to estimate the (marginal) fre- 
quency distributions of number-right scores that would have resulted 
if all examinees had taken both tests, without practice or fatigue 
effects. 

In order to do this, the item and ability parameters in (4) and (5) 
must all be on the same scale. This is usually accomplished by 
administering a suitable 'anchor test' to both groups of examinees. All 
answer-sheet responses for both groups are used in a single computer 
run that estimates all parameters on the same scale. These estimates 
are then used in (4) and (5), substituting N + M for N or M , to 
obtain the distributions of x and y for the combined group of N + M 
examinees. Equipercentile equating of y to x is then carried out 
in the usual way. 
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3. Theoretical Perspectives 

Practical workers, with the need for equating scores on two 
different test forms, have over the years used widely different methods 
(see Angoff, 1971) in an attempt to approximate the desired result. 
Each practical worker, needing a word to describe his results, asserts 
that he has produced an equating of y to x .. Yet diff rent methods 
and different groups do riot produce identical 'equa tings'. 

Braun and Holland (1^82, page 14) state: "There is some 
disagreement over what test equating is and the proper method for doing 
it." They then adopt the definition "Form-X and Fora-Y are equated on 
[population] P " if the distribution of the transformed y scores in 

population P is the same as the distribution of the (untransf ormed) x 

j 

scores • 

This definition of the phrase 'equated on population P 1 is 

1 i 
beyond reproach. One problem, however, is that the 1 qualifying phrase 

•on population P 1 is typically dropped by the practical worker who 

writes a research report or publishes an equating table in a test 

manual. 

Unfortunately (as will be shown later in this section) two tests 
that are equated on population P will typically not be equated for 
various subpopulations that are included in P . Test scores that are 
equated for the population of college 'applicants may well be equated 
neither for the population of female college applicants , nor for the 



\ 

\ 

\ 
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the population of male college applicants. The scores arc still less 
likely to be equated for a subpopulation characterized by interest in 
science, or in music. For the subpopulation of Harvard applicants, the 
situation is much worse. 

If the proportion of applicants admitted to Harvard differs 
significantly depending on whether they were given form X or 
form Y of the test, it is clear that the 'equating' was unsuccessful. 
Since similar inequalities are' likely to characterize any equating 
on any specified population, it may be Vest not to say that the tests 
are 'equated' at all, or to simply say that they are 'approximately 
equated.' 

From a practical point of view, the approximation may be quite 
satisfactory for many subgroups. It is unlikely, however, that the 
equating will be adequate for any subpopulation having a mean and 
variance of ability that is sharply different from the mean and variance 
of the total population used to derive the equating transformation. 
Extensive practical data illustrating the adequacies and the 

inadequacies of approximate equatings are given in the 30-volume 

\ 

Anchor Test Study (Loret, Seder, BianchiniV and Vale, 1974). 

For a theoretical discussion of alternative equating methods, 
however, it is important not to start out with a definition of equating 
that is clearly inadequate for subpopulations of examinees. Given that 
the IRT model holds, IRT observed-score equating would, for example, be 
automatically endorsed by the Braun and Holland definition, since their 
definition mandates equipercentile equating. IRT true-score equating 
would be definitely rejected by their definition, since in general it 
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will not lead to x scores and transformed y scores having the same 
frequency distribution, unless X and Y are strictly parallel, forms 
that are identical in difficulty, in reliability, and also in most other 
■ respects . ■ 

The important virtue of IRT true-score equating is that if the IRT 
model holds, the true scores are clearly equated for all subpopulations 
of examinees. This results from the invariance of IRT parameters across 
populations of examinees, assumed by the IRT model. The clear flaw in 
IRT true-score equating is that it equates true scores, not the 
actually observed fallible scores. Treating observed scores as if they 
were true scores cannot be justified on any theoretical grounds. 

The virtue of IRT observed-score equating is that in a group like 
that used to derive the equating, any cutting score will accept the 
same percentage of examinees regardless of the test administered. The 
flaw is that this holds only for that total group and not for other 
groups or subgroups. 

This last statement is most clearly seen from a very extreme 
example. Suppose forms X and Y have the same number of items, 
measure the same ability 6 , but differ in difficulty. If the 
equipercentile equating, is carried out on a group of examinees all 
of whom are guessing at random on almost all the items, the difference 




equipercentile equating will approximate an identity transformation of 
score y . If a slightly more competent grpup^of examinees is used for 
the equipercentile equating, however, the d'if f erence^in^diificulty 



I 
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. ' ' \ 

between forms will begin to become apparent and most y scores will be 
adjusted upwards or^downwards accordingly. As the competence of the 
group used becomes higher and higher, the equating transformation found 
will differ more and more from the identity transformation found from 
the original extreme group. 

As a second example of the inescapable invalidity of observed- 

I 

score equating, suppose ^hat tests X and Y are of equal difficulty 

and that the true. scores % and n have equal variance, but that 

/ 

y is much less reliable than x . Consider a subgroup of very 
talented examinees; to make the illustration clear, consider that in 
this subgroup all examinees have nearly identical 6 values. Most of 
the variation in observed scores x and y is now due to errors of 
measurement. The equipercentile equating transformation found will thus 

approximate a straight line with slope , 

v 

standard deviation of the errors of measurement in x 
standard deviation of the errors of measurement in y 

Since y is much less reliable than x , the slope will be much 
less than !• 

If, on the other hand, they equipercentile equating transformation 
is found from a group where the true-score variance is large compared to 
^the errors variances, the transformation will tend to approximate a 
straight line with slope 

standard deviation of true scores on x 
standard deviation of true scores on y 

14 
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Intermediate situations will provide transformations with intermediate 
slopes. If the Wong equating is applied to any given subpopulation , 
the population of examinees in the subpopulation accepted will depend 
on whether they ..took test X or test Y , an inequitable result. 

Our theoretical position, then, is that each method described in 

Section 2 (as well as all other available equating methods) has its 

\ ■ 

own inadequacies. Since, in practice, some (approximate) equating 
method must be used, it will be informative to investigate empirically 
how the two methods of Section 2 compare in a specially contrived 
practical situation where the correct equating is actually known in 
advance . 



4 . Data 



These two equating methods were used to equate the chain of six SAT 
verbal tests described by Petersen, Cook and Stocking in the report 
IRT Versus Conventional Equating Methods : A Comparative Study of Scale 
Stability . The tests in this chain were selected such that the first 
i test and the last test are the same. Each test is equated to the next 
test in the chain using an anchor test. Figure 1 is a diagram of the 
chain. The capital letters represent the test form, the small letters 
represent the anchor test. Scores on form V4 are equated to scores on 
form X2 using the anchor. test fe . These equated scores/ on X2 are 
equated to scores on form Y3 using the anchor test fm . This gives 
us an equating of form V4 to Y3 . In this manner, one proceeds 
through the chain, with the final equating of Z5 to V4 giving us a 
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V4 Ce 

, fe X2 
/ 



X2 fm 

fm Y3 



Y3 Cw 

fw B3 



B3 Ek. 

£k Y2 



Y.2 fu 

£u Z5 



Z5 et 

et V4 



Figure 1. Chain of six SAT verbal equatings. Upper N^se letters 
designate test forms; lower case letters designate 
anchor tests. , / 
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table of scores on the original V4 equated to the scores on the V4 at 
the end of the chain • Any deviation from equality between the two sets 
of scores could be attributable to scale drift or lack of model fit, 

♦^^Each form in the chain has 85 items except form V4 which has 90 
items. Each anchor test has AO items. For each form there are two 
samples of examinees; each sample taking a different anchor test. The two 

groups taking each form were random samples from the same population for 

c 

all of^the forms except Y3 . For the parameter estimation runs a 
random sample of approximately 2670 examinees was selected from the data 
obtained at the test administration of that form and anchor test. 

5. Parameter Calibration 

The item parameters and abilities were estimated by a modified 
version of the computer program LOGIST, (Wood, Wingersky, & Lord, 1976) 
in six separate calibration runs. In Figure 1, each box (containing two 
forms and one anchor test) represents one LOGIST run. The item 
responses for items not taken by an examinee, such as the X2 items for 
examinees taking form V4 in box 1, are treated as not reached items. 

All of the estimated parameters within each LOGIST run are on the 
same scale and either method of equating can be used to equate the 
scores for the two tests. The anchor tests are not used directly in the 
equating, but are used in LOGIST so that the estimated parameters 
within a LOGIST run are on the same scale ./ 

/ , 
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6. Results 

In using the IRT observed-score equating method, two estimated dis- 
tributions of observed scores are equated so that the transformed y 
scores and the (untransf ormed) x scores have the same distribution, 
figure 2 is presented to demonstrate , that at least for one set of data, 
this estimated distribution of observed scores isj a. reasonable fit to 
^he actual distribution of observed scores. The frequencies are 
plotted against formula scores which are the number right minus a 
fraction of the number wrong. The fraction is one over the number of 

/ 

choices. Since the estimated observed-score distribution can only be 
obtained for number-right scores, the transformation to formula scores 
assumes that there are no omits, that is, that the number wrong is the 
total number of items minus the number right. In order to compare the 
two distributions, the observed-score distribution should be based on a 
group that has no omits. Consequently, a form of the SAT verbal 
different from the ones in the chain was used for this Figure in order 
to get a sufficiently large, enough sample for the frequency distribution 
and for the item calibration. 

The agreement shown in Figure 2 is good except that the tails of 
the estimated distribution are too high. This discrepancy is presumably 
due to the use of estimated 8 in place of true "6 for the 
practical implementation of (4). Since a similar discrepancy affects 



is 



Comparison of IRT 
17 



the estimated observed-score distributions of both test X and test 
Y , the effects of the discrepancies tend to cancel out in the equating 
process. 

In our chain-equating study, each method of equating was applied 

separately to the whole chain of equatings,- resulting in a line for each 

method equating form V4 at the beginning of the chain to form V4 at the 

end of the chain. These two lines are plotted in Figure 3 along with a 

45° line. The solid line is the IRT true-score equating line; the 

dotted line, falling practically on top of the solid line, is the IRT 

observed-score equating line. To equate* scores below chance level, 
n 

that is Z c. , for the IRT true-score line, the method given on 
i-1 

pages 210-211 of Lord (1980) was used. For scores above 0, the maximum 

difference between the two equatings was .2; for scores below 0, the 

maximum difference was .8 which occurred at the chance level. If the 

equating methods were perfect and there were no scale drift, the 

i 

equating line would be the dashed 45° line. 

Figure 4 shows the two equating methods applied to one individual 
link in the chain. This particular link was selected because the IRT 
true-score equating line between these two forms had the greatest 
discontinuity in the slope at the chance level. The largest difference 
between the two lines occurred at the chance level and was 1.6. For 
scores above 0 the maximum difference between the two lines was .4. 



0 

z 

Oh 
h< 



0 
1 

H 



HOW 3 
<0O 0 
DlliZ U 

0 H ui 
ULUK- S i 

DODO 
(KO0 « 

»8ox # 

til I HJh *< 
IL 0) OKZ 

^ 

□< j I m 

ZUI wUJH 
OhU HO" 

<h> 

Q.WO 
2W 
0 

OQ 

z 
< 



I 



2 



0 . 

u 



0_, 
N 



z 




H 




< 




I 




0 




IL 




0 






If) 


0 




2 
















o: 




CO 


0. 

10 






< 




J 




D 




2 




a 




0 




Ll 


0. 


0 




in 




h 




< 




i 

UI 






H 

00 



■0 

0) 
1 
I* 
CO 
0 
9 

0 

Hi 

H 

B 



23 



10 30 50 

FORMULA SCORE - ORIGINAL 



ERIC 



0 

z 

H 

< 

J 
0 

u 
Li 

C£ 
0 



0 

z 

Oh 
Zh 

H< 

hD 
«3 
Dill 
0 Z 

liJUjH 

lilOlO 
(TOO W 

_o i z gi 

WIW fc< 

OQilUH 

IL ID OK 
U.OUJ go 

00 

zuiz hij 

OhH k3 
W<0 f»K 



h2 

<h 
0.0 
Zlll 
0 

UQ 
Z 
< 



lilh 




0 
0 
9 
<i 

0) 

n 

¥• 

CD . 

0 

0 

0 

Hi 

H 

■s 



FORMULA SCORE - OLD FORM 



Comparison of IRT 
20 

Given that there is no clear theoretical justification for applying 
IRT true-score equating to observed scores and that the equipercentile 
equating of the IRT observed-score distributions is population 
dependent, the close agreement between the two lines is reassuring. 
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